Reducing uncertainty about common-mode failures

نویسندگان

  • Jeffrey M. Voas
  • Anup K. Ghosh
  • Frank Charron
  • Lora Kassab
چکیده

Multi-version programming is employed in faulttolerant computer systems in order to provide protection against common-mode failure in software. Multi-version programming involves building diverse software implementations of critical functions. The premise of building diverse versions is that the likelihood of a programming error in one version causing a failure in an identical manner as an error in another version is reduced. Skeptics of multi-version programming have correctly pointed out that common-mode failures between redundant diverse versions can reduce the return on investment in creating diverse versions. To date, other than using historical data from other projects, there has been no way to estimate the potential for a given multi-version programming system to suffer a common-mode failure. This paper presents an algorithm and software analysis prototype to reduce the uncertainty of whether software flaws in diverse versions can result in common-mode failure. The analysis uses software faultinjection techniques to subject one or more versions to anomalous behavior. From this, we can predict how the software will behave if real faults exist in the multiple versions. 1. Diversity and Various Perspectives Although software systems made of redundant software programs are fairly uncommon in the United States, they are looked upon more favorably elsewhere. Airbus Industrie, the European consortium which competes directly with Boeing Co., uses diverse software programs for the A320/ A330/A340 electrical flight control systems. For example, Airbus uses two (2) different types of computers for flight control in the A320. The two computers, whose monikers are SEC and ELAC, are designed and manufactured by different equipment manufacturers using different microprocessors, different computer architectures, and different functional specifications. Each flight control computer uses one channel for control and another channel for monitoring. Since a different software program is used for each of these channels on each redundant computer, a total of four (4) different software packages are used in the control and monitoring of the A320 flight control system [15][4]. By achieving diversity in hardware and software, Airbus hopes to mitigate the common-mode failure problem in redundant computer systems. Redundancy is also prevalent in nuclear power systems. Digital instrumentation and control systems in nuclear power plants employ independent protection systems to detect system failures in order to isolate and shut-down failed subsystems. The U.S. Nuclear Regulatory Commission (NRC) has developed a position with respect to diversity, as stated in the technical position document “Digital Instrumentation and Control Systems in Advanced Plants” [13]. Two excerpts from this document are particularly relevant for requiring the assessment of common-mode failures: 1. The applicant shall assess the defense-in-depth and diversity of the proposed instrumentation and control system to demonstrate that vulnerabilities to commonmode failures have been adequately addressed. The staff considers software design errors to be credible common-mode failures that must be specifically included in the evaluation. 2. In performing the assessment, the vendor or applicant shall analyze each postulated common-mode failure for each event that is evaluated in the analysis section of the safety analysis report (SAR) using best-estimate methods. The vendor or applicant shall demonstrate adequate diversity within the design for each of these events. The Canadian Atomic Energy Control Board (AECB) recognizes the danger of common-mode failures in nuclear control applications as well. The AECB requirements for achieving software diversity are succinctly stated in draft guide C-138, “Software in Protection and Control Systems” as [1]: To achieve the required levels of safety and reliability, the system may need to be designed to use multiple, diverse components performing the same or similar functions. For example, AECB Regulatory Documents R-8 and R-10 require two independent and diverse protective shutdown systems in Canadian nuclear power reactors. It should be recognized that when multiple components use software to provide similar functionality, there is a danger that design diversity may be compromised. The design should address this danger by enforcing other types of diversity such as functional diversity, independent and diverse sensors, and timing diversity. Clearly these two nuclear regulatory bodies have recognized common-mode failures as a critical weakness in redundant component implementations of nuclear control systems. It is interesting to note that they recommend preventative design measures as well as evaluative measures to address the common-mode failure problem. From industry recommendations, the U.S. Federal Aviation Administration (FAA) has formulated a different perspective on redundancy. Their position is that since the degree of protection afforded by design diversity is not quantifiable, employing diversity will only be counted as additional protection beyond the already required levels of assurance [6]: The degree of dissimilarity and hence the degree of protection is not usually measurable. Probability of loss of system function will increase to the extent that the safety monitoring associated with dissimilar software versions detects actual errors or experiences transients that exceed comparator threshold limits. Multiple software versions are usually used, therefore, as a means of providing additional protection after the software verification process objectives for the software level have been satisfied. The U.S. Office of Device Evaluation of the Center for Devices and Radiological Health of the U.S. Food and Drug Administration (FDA) has issued a report that applies to the software aspects of pre-market notification submissions for medical devices [7]. The FDA does not dictate any particular approach to safety, nor does it dictate specific software quality assurance and development procedures. Because there is no specification on how safety is to be achieved nor demonstrated, the FDA provides no guidance on redundancy and diversity. 2. Software Composition Fault-tolerant computer systems that use redundancy are employed in a wide-range of safety-critical and ultrareliable applications, including nuclear control, flight control, and medical devices. In systems where software is replicated on redundant platforms, a failure resulting from a flaw in software on one platform is certain to result in other redundant platforms for the same input, since the replicated software has replicated flaws. In these types of fault-tolerant architectures, redundant hardware platforms only protect the system from anomalous or transient errors resulting from hardware faults or external corruptions. The reliability afforded such a system by the software can be modeled as a series reliability block diagram, because the reliability of the multiple software versions is only as good as the reliability of a single version. In order to provide some level of protection against redundant software programs failing identically, diverse multi-programming, also known as N-version programming, has been advocated [2][3]. Other references that more fully discuss N-version programming are [11] and [12]. In N-version programming, different software versions, written to the same specification but developed independently, are run in parallel. If there is no correlation between version failures, then the system’s dependability is essentially the product of version dependabilities—which has the potential for construction of nearly perfect systems from imperfect programs. Diversity in program versions attempts to prevent redundant programs from failing identically, or from failing simultaneously. To protect against common design errors, diversity in design is employed. Functional diversity involves specifying that different programs have different functional requirements. For example, one program might do a linear search, while another performs a binary search. The goal of finding the element in a list might be the same, but the algorithm specified will be different. The techniques specified in this paper are well-suited to both types of software diversity. Despite the diverse software implementations, it is believed that common errors compromise the independence between multiple versions that is needed to make diversity worthwhile [9][5]. By definition, a common-mode failure (CMF) occurs when two or more software versions fail in exactly the same way for the same input. Common-mode failures are said to occur when there exists at least one input combination for which the outputs of two or more versions are erroneous, and the outputs are identical for all possible input combinations [8]. Thus, if two or more versions respond to all inputs in the same way, and there is at least one input combination that causes them both to respond incorrectly, then a common-mode failure has occurred. Since it is a practical impossibility to test that all versions of a program will respond identically to all inputs, we loosen the definition for common-mode failure by defining a “single input” common-mode failure. A single input CMF occurs when there exists at least one input for which the outputs of two or more versions of a program are identically incorrect. This definition loosens the restriction that the outputs for all inputs must also be identical. Note that CMFs do not have to occur only from identical faults, however, that class of problem is the predominant cause of CMFs. In this paper, we develop techniques that allow prediction of CMFs that result not only from identical faults, but also from uncorrelated anomalies in different combinations of versions. Knight and Leveson demonstrated that different programmers can make the same logical error [9]. An additional result involved cases where different logical errors yielded common-mode failures in completely distinct algorithms or in different parts of similar algorithms. The technique presented here is concerned with these “different logical errors”. The technique can determine if a flaw in one function in one version can result in a common mode failure with a flaw in a different function in another version. The technique can be applied equally to faults in similar functions, as well, though this outcome has less value since the likelihood of CMFs is greater for faults in similar functions. The goal of this analysis is to provide both an indication of the potential for common-mode failures to occur and the statements of where faults could hide that cause CMFs. This information, in turn, can be used by developers of multi-version fault-tolerant systems to make them more robust against common-mode failures. 3. Assessing the Likelihood of Common-mode Failures Recognizing the importance of predicting the potential for common-mode failures, we have developed an algorithm and a prototype software analysis tool to observe common-mode failures produced by combinations of simulated programmer faults. Consider the simple Nversion system depicted in Figure 1. This figure illustrates an architecture of an N-version system that is composed of N independent programs executing identical inputs sampled from the input space. Each program consists of four functions labeled through that were coded from a common specification. To achieve fault tolerance through replication, each version in Figure 1 would be an identical replica. In this case, software flaws in one version will be present in each of the other versions. An input that triggers a flaw in one version that ultimately causes a faulty output to result will guarantee a common-mode failure between the identical versions. Fault tolerance through N replicated versions can only be achieved when faults that affect one version do not affect another, e.g., a hardware fault or external corruption. To achieve fault tolerance through diversity, each version implements the individual functions in a diverse manner. The rationale behind programming diverse versions is to prevent programming errors in different versions from resulting in common failures. For example, system fault α δ N 1 – Figure 1: N-version system architecture Input Space A Version 1 Version N voter System

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MTBF evaluation for 2-out-of-3 redundant repairable systems with common cause and cascade failures considering fuzzy rates for failures and repair: a case study of a centrifugal water pumping system

In many cases, redundant systems are beset by both independent and dependent failures. Ignoring dependent variables in MTBF evaluation of redundant systems hastens the occurrence of failure, causing it to take place before the expected time, hence decreasing safety and creating irreversible damages. Common cause failure (CCF) and cascading failure are two varieties of dependent failures, both l...

متن کامل

Optimization the Availability of a System with Short Circuit and Common Cause ‎Failures‎

Redundancy allocation problem is one of the most important problem in Reliability area. In this problem the reliability and availability of the systems maximized via allocating redundant components to sub-systems. a systems operates normally in its operational mode but fails in either opened or shorted modes. this paper presents a repairable k_out_of_n systems network model with common cause fa...

متن کامل

Analysis of Vector Estimating Modulation Method to Eliminate Common Mode Voltage

Abstract The problem of common mode voltage in inverters can be considered as a major issue which leads to motor bearing failures. To eliminate these voltages, proposing some methods seems to be necessary. This paper has a comparative study on estimating modulation methods of eliminating common mode voltage. The main idea of these methods is based on generation of reference vector with nearest ...

متن کامل

Common Cause Failures and Ultra Reliability

A common cause failure occurs when several failures have the same origin. Common cause failures are either common event failures, where the cause is a single external event, or common mode failures, where two systems fail in the same way for the same reason. Common mode failures can occur at different times because of a design defect or a repeated external event. Common event failures reduce th...

متن کامل

Space Vector Pulse Width Modulation with Reduced Common Mode Voltage and Current Losses for Six-Phase Induction Motor Drive with Three-Level Inverter

Common-mode voltage (CMV) generated by the inverter causes motor bearing failures in multiphase drives.On the other hand, presence of undesired z-component currents in six-phase induction machine (SPIM) leads to extra current losses and have to be considered in pulse width modulation (PWM) techniques. In this paper, it is shown that the presence of z-component currents and CMV in six phase driv...

متن کامل

Detecting Common Mode Failures in N-Version Software Using Weakest Precondition Analysis

An underlying assumption for N-version programming technique is that independently developed versions would fail in a statistically independent mannel: Howevel; empirical studies have demonstrated that common mode failures can occur even for independently developed versions, and that common mode failures degrade system reliability. In this papel; we demonstrate that the weakest precondition ana...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997